Clustering exoplanets using machine learning in order to gather insight about potentially habitable planets outside of our solar system. Considering the fact that even the closest planets outside of the solar system are unreachable, there is no way to classify a planet as habitable or not with accuracy. However, there are some key elements which are considered necessary for habitability. The aim of this project is to gather insight of how grouping planets, without making any habitability indicing calculations, could affect filtering based on habitability. The dataset used for this project is directly fetched from NASA's exoplanet archive, which will be clustered for further analysis. The results could be utilized to filter out false positives, saving time and resources which would be spent on those planets otherwise.
Problem
Data Understanding
Data Preparation
Modeling
Evaluation
References
Describe the problem here. What are the questions you are trying to solve?
Considering the fact that the tools used to identify exoplanets are biased, it seems plausible to begin the analysis without any prior knowledge of the planet, where the algorithm only groups planets that correleate. Then, from these correleations, more insight could be gained on placing key plantes which are known to be habitable or close to be habitable, such as Earth and Mars. However, these results should not be taken at face value, they exist simply to filter unwanted planets or false positives as much as possible.
Imports required for the notebook
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
import scipy.stats as stats
import warnings
import math
from matplotlib.lines import Line2D
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial import ConvexHull
from scipy import interpolate
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.preprocessing import normalize
from sklearn.neighbors import KernelDensity
from sklearn.metrics.cluster import rand_score
from sklearn.decomposition import PCA
Reading the data, Supressing chained assignment warning, since chained assignment works six to seven times faster, supressing future warnings, setting DPI of figures
warnings.simplefilter(action='ignore', category=FutureWarning)
df = pd.read_csv("exoplanets.csv",header=0, index_col=0)
pd.set_option('mode.chained_assignment', None)
mpl.rcParams['figure.dpi'] = 350
The dataset used is directly from NASA's exoplanet archive, which the user "SATHYANARAYAN RAO" on Kaggle has compiled and created a comma-separated value (.csv) version of it.
The dataset could be reached from the link here ( .csv format)
The data contains a lot of fields used for labelling, which is not within the interest of the project. A lot of information about the host star of the system is present, which could be used for further analysis, however, due to scope of the project being focused on the attributes of the planets, they possess no use for the project. There also exists some columns with no suitalbe definitions found.
You can reach the column analysis spreadsheet from here
df.head(10)
| pl_name | hostname | default_flag | sy_snum | sy_pnum | discoverymethod | disc_year | disc_facility | soltype | pl_controv_flag | ... | sy_vmagerr2 | sy_kmag | sy_kmagerr1 | sy_kmagerr2 | sy_gaiamag | sy_gaiamagerr1 | sy_gaiamagerr2 | rowupdate | pl_pubdate | releasedate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11 Com b | 11 Com | 1 | 2 | 1 | Radial Velocity | 2007 | Xinglong Station | Published Confirmed | 0 | ... | -0.023 | 2.282 | 0.346 | -0.346 | 4.44038 | 0.003848 | -0.003848 | 2014-05-14 | 2008-01 | 2014-05-14 |
| 1 | 11 Com b | 11 Com | 0 | 2 | 1 | Radial Velocity | 2007 | Xinglong Station | Published Confirmed | 0 | ... | -0.023 | 2.282 | 0.346 | -0.346 | 4.44038 | 0.003848 | -0.003848 | 2014-07-23 | 2011-08 | 2014-07-23 |
| 2 | 11 UMi b | 11 UMi | 0 | 1 | 1 | Radial Velocity | 2009 | Thueringer Landessternwarte Tautenburg | Published Confirmed | 0 | ... | -0.005 | 1.939 | 0.270 | -0.270 | 4.56216 | 0.003903 | -0.003903 | 2018-04-25 | 2011-08 | 2014-07-23 |
| 3 | 11 UMi b | 11 UMi | 1 | 1 | 1 | Radial Velocity | 2009 | Thueringer Landessternwarte Tautenburg | Published Confirmed | 0 | ... | -0.005 | 1.939 | 0.270 | -0.270 | 4.56216 | 0.003903 | -0.003903 | 2018-09-04 | 2017-03 | 2018-09-06 |
| 4 | 11 UMi b | 11 UMi | 0 | 1 | 1 | Radial Velocity | 2009 | Thueringer Landessternwarte Tautenburg | Published Confirmed | 0 | ... | -0.005 | 1.939 | 0.270 | -0.270 | 4.56216 | 0.003903 | -0.003903 | 2018-04-25 | 2009-10 | 2014-05-14 |
| 5 | 14 And b | 14 And | 0 | 1 | 1 | Radial Velocity | 2008 | Okayama Astrophysical Observatory | Published Confirmed | 0 | ... | -0.023 | 2.331 | 0.240 | -0.240 | 4.91781 | 0.002826 | -0.002826 | 2014-07-23 | 2011-08 | 2014-07-23 |
| 6 | 14 And b | 14 And | 1 | 1 | 1 | Radial Velocity | 2008 | Okayama Astrophysical Observatory | Published Confirmed | 0 | ... | -0.023 | 2.331 | 0.240 | -0.240 | 4.91781 | 0.002826 | -0.002826 | 2014-05-14 | 2008-12 | 2014-05-14 |
| 7 | 14 Her b | 14 Her | 0 | 1 | 2 | Radial Velocity | 2002 | W. M. Keck Observatory | Published Confirmed | 0 | ... | -0.023 | 4.714 | 0.016 | -0.016 | 6.38300 | 0.000351 | -0.000351 | 2021-09-20 | 2021-05 | 2021-09-20 |
| 8 | 14 Her b | 14 Her | 0 | 1 | 2 | Radial Velocity | 2002 | W. M. Keck Observatory | Published Confirmed | 0 | ... | -0.023 | 4.714 | 0.016 | -0.016 | 6.38300 | 0.000351 | -0.000351 | 2018-04-25 | 2003-01 | 2014-08-21 |
| 9 | 14 Her b | 14 Her | 0 | 1 | 2 | Radial Velocity | 2002 | W. M. Keck Observatory | Published Confirmed | 0 | ... | -0.023 | 4.714 | 0.016 | -0.016 | 6.38300 | 0.000351 | -0.000351 | 2018-04-25 | 2008-04 | 2014-08-21 |
10 rows × 92 columns
First, rows missing over %50 of their data will be dropped, since their quantity is a fraction of the data the impact is negligible.
Then, the columns that are not useful for the model will be dropped, refer to Data Understanding for the details about the columns.
Finally an analysis of null values within the remaining data will be conducted. The KNN imputation method seems to consistently fill the missing data based on minimal changes of data variety before and after the imputation.
def get_empty_col(row_index):
total_nan = 0
for col in df.columns:
if pd.isna(df[col][row_index]):
total_nan += 1
return total_nan
def get_cols_info():
n_col = 6
fig, axes = plt.subplots(nrows=(len(df.columns) // n_col), ncols=n_col, figsize=(18,18))
ax_index = 0
fig.suptitle(f'Distribution of Columns Visualized')
for col in df.columns:
sample = df[col].sort_values()
#attributes
mean = sample.mean()
std = sample.std()
col_max = sample.max()
col_min = sample.min()
sem = sample.sem()
sample -= mean
pdf = stats.norm.pdf(sample,mean,std)
ax = axes[ax_index // n_col][ax_index % n_col]
#normalize plot
ax.hist(sample, bins=20, density=True, label="Data", color='gray')
ax.plot(sample, pdf, 'r-', label="PDF")
sns.kdeplot(data=sample,ax=ax, label = 'KDE')
ax.legend(loc='best')
x_ticks = np.arange(-4*std, 4.1*std, std)
x_labels = [f"{i}std" for i in range(-4,5)]
ax.grid(True, alpha=0.3, linestyle="--")
ax.set_title(col)
ax_index += 1
#box plot
fig.tight_layout()
df.loc[:,~df.columns.str.contains('err') & ~df.columns.str.contains('flag')].plot(kind='box',title='Box Plot', showfliers = False, figsize= (22,6))
df['empty_col_count'] = 0
for i in df.index:
df['empty_col_count'][i] = get_empty_col(i)
df = df.sort_values(['empty_col_count','rowupdate'], ascending=[True,True]).drop_duplicates('pl_name').sort_index()
print('Max:\t', df['empty_col_count'].max())
print('Min:\t',df['empty_col_count'].min())
print('Total:\t',len(df.columns))
Max: 61 Min: 0 Total: 93
df = df[df['empty_col_count'] < 50]
df.head()
| pl_name | hostname | default_flag | sy_snum | sy_pnum | discoverymethod | disc_year | disc_facility | soltype | pl_controv_flag | ... | sy_kmag | sy_kmagerr1 | sy_kmagerr2 | sy_gaiamag | sy_gaiamagerr1 | sy_gaiamagerr2 | rowupdate | pl_pubdate | releasedate | empty_col_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11 Com b | 11 Com | 1 | 2 | 1 | Radial Velocity | 2007 | Xinglong Station | Published Confirmed | 0 | ... | 2.282 | 0.346 | -0.346 | 4.44038 | 0.003848 | -0.003848 | 2014-05-14 | 2008-01 | 2014-05-14 | 16 |
| 4 | 11 UMi b | 11 UMi | 0 | 1 | 1 | Radial Velocity | 2009 | Thueringer Landessternwarte Tautenburg | Published Confirmed | 0 | ... | 1.939 | 0.270 | -0.270 | 4.56216 | 0.003903 | -0.003903 | 2018-04-25 | 2009-10 | 2014-05-14 | 16 |
| 6 | 14 And b | 14 And | 1 | 1 | 1 | Radial Velocity | 2008 | Okayama Astrophysical Observatory | Published Confirmed | 0 | ... | 2.331 | 0.240 | -0.240 | 4.91781 | 0.002826 | -0.002826 | 2014-05-14 | 2008-12 | 2014-05-14 | 24 |
| 7 | 14 Her b | 14 Her | 0 | 1 | 2 | Radial Velocity | 2002 | W. M. Keck Observatory | Published Confirmed | 0 | ... | 4.714 | 0.016 | -0.016 | 6.38300 | 0.000351 | -0.000351 | 2021-09-20 | 2021-05 | 2021-09-20 | 17 |
| 19 | 16 Cyg B b | 16 Cyg B | 0 | 3 | 1 | Radial Velocity | 1996 | Multiple Observatories | Published Confirmed | 0 | ... | 4.651 | 0.016 | -0.016 | 6.06428 | 0.000603 | -0.000603 | 2018-04-25 | 2007-09 | 2014-09-18 | 17 |
5 rows × 93 columns
The data is also reindexed
earth = {'pl_name':'Earth','hostname':'Sol','sy_snum':1,'sy_pnum': 8,'pl_orbper': 365,'pl_orbsmax': 1,'pl_rade': 1,'pl_bmasse': 1,'pl_bmassj': 0.0031,'pl_orbeccen': 0.0167,'pl_eqt': 255,'ttv_flag':1 ,'st_teff': 5780,'st_rad':1,'st_mass': 1,'st_met': 0.012,'st_logg': 4.438}
venus = {'pl_name':'Venus','hostname':'Sol','sy_snum':1,'sy_pnum': 8,'pl_orbper': 225,'pl_orbsmax': 0.723,'pl_rade': 0.949,'pl_bmasse': 0.815,'pl_bmassj': 0.00256,'pl_orbeccen': 0.006772,'pl_eqt': 226,'ttv_flag':1,'st_teff': 5780,'st_rad':1,'st_mass': 1,'st_met': 0.012,'st_logg': 4.438}
mars = {'pl_name':'Mars','hostname':'Sol','sy_snum':1,'sy_pnum': 8,'pl_orbper': 687,'pl_orbsmax': 1.524,'pl_rade': 0.532,'pl_bmasse': 0.107,'pl_bmassj': 0.000338,'pl_orbeccen': 0.093,'pl_eqt': 210,'ttv_flag':1,'st_teff': 5780,'st_rad':1,'st_mass': 1,'st_met': 0.012,'st_logg': 4.438}
df = df.append(earth,ignore_index=True)
df = df.append(venus,ignore_index=True)
df = df.append(mars,ignore_index=True)
df.head()
| pl_name | hostname | default_flag | sy_snum | sy_pnum | discoverymethod | disc_year | disc_facility | soltype | pl_controv_flag | ... | sy_kmag | sy_kmagerr1 | sy_kmagerr2 | sy_gaiamag | sy_gaiamagerr1 | sy_gaiamagerr2 | rowupdate | pl_pubdate | releasedate | empty_col_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11 Com b | 11 Com | 1.0 | 2 | 1 | Radial Velocity | 2007.0 | Xinglong Station | Published Confirmed | 0.0 | ... | 2.282 | 0.346 | -0.346 | 4.44038 | 0.003848 | -0.003848 | 2014-05-14 | 2008-01 | 2014-05-14 | 16.0 |
| 1 | 11 UMi b | 11 UMi | 0.0 | 1 | 1 | Radial Velocity | 2009.0 | Thueringer Landessternwarte Tautenburg | Published Confirmed | 0.0 | ... | 1.939 | 0.270 | -0.270 | 4.56216 | 0.003903 | -0.003903 | 2018-04-25 | 2009-10 | 2014-05-14 | 16.0 |
| 2 | 14 And b | 14 And | 1.0 | 1 | 1 | Radial Velocity | 2008.0 | Okayama Astrophysical Observatory | Published Confirmed | 0.0 | ... | 2.331 | 0.240 | -0.240 | 4.91781 | 0.002826 | -0.002826 | 2014-05-14 | 2008-12 | 2014-05-14 | 24.0 |
| 3 | 14 Her b | 14 Her | 0.0 | 1 | 2 | Radial Velocity | 2002.0 | W. M. Keck Observatory | Published Confirmed | 0.0 | ... | 4.714 | 0.016 | -0.016 | 6.38300 | 0.000351 | -0.000351 | 2021-09-20 | 2021-05 | 2021-09-20 | 17.0 |
| 4 | 16 Cyg B b | 16 Cyg B | 0.0 | 3 | 1 | Radial Velocity | 1996.0 | Multiple Observatories | Published Confirmed | 0.0 | ... | 4.651 | 0.016 | -0.016 | 6.06428 | 0.000603 | -0.000603 | 2018-04-25 | 2007-09 | 2014-09-18 | 17.0 |
5 rows × 93 columns
drop_columns = [
'default_flag',
'discoverymethod',
'disc_year',
'disc_facility',
'soltype',
'pl_controv_flag',
'pl_refname',
'sy_refname',
'st_refname',
'rastr',
'ra',
'decstr',
'dec',
'st_metratio',
'pl_pubdate',
'releasedate',
'pl_radjlim',
'pl_bmassprov',
'pl_bmasselim',
'pl_bmassjlim',
'pl_orbperlim',
'pl_orbsmaxlim',
'pl_radelim',
'pl_orbeccenlim',
'pl_insollim',
'pl_eqtlim',
'st_tefflim',
'st_radlim',
'st_masslim',
'st_metlim',
'st_metratio',
'st_logglim',
'pl_bmasseerr2',
'pl_bmassjerr2',
'pl_orbsmaxerr2',
'st_tefferr2',
'st_raderr2',
'st_masserr2',
'st_meterr2',
'st_loggerr2',
'sy_vmagerr2',
'sy_kmagerr2',
'sy_gaiamagerr2',
'pl_radeerr2',
'pl_orbpererr2',
'pl_insolerr2',
'sy_disterr2',
'empty_col_count'
]
for name, val in df.items():
na_ratio = val.isnull().sum() / len(val)
if(na_ratio > 0.7):
print(f"{name:<16} {na_ratio*100:.3f}%")
drop_columns.append(name)
pl_radjerr1 70.065% pl_radjerr2 70.065% pl_orbeccenerr1 75.375% pl_orbeccenerr2 75.375% pl_eqterr1 78.415% pl_eqterr2 78.415% st_spectype 78.091%
df = df.drop(columns=drop_columns)
df.head(10)
| pl_name | hostname | sy_snum | sy_pnum | pl_orbper | pl_orbpererr1 | pl_orbsmax | pl_orbsmaxerr1 | pl_rade | pl_radeerr1 | ... | st_loggerr1 | sy_dist | sy_disterr1 | sy_vmag | sy_vmagerr1 | sy_kmag | sy_kmagerr1 | sy_gaiamag | sy_gaiamagerr1 | rowupdate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11 Com b | 11 Com | 2 | 1 | 326.0300 | 0.3200 | 1.290 | 0.050 | NaN | NaN | ... | 0.10 | 93.1846 | 1.92380 | 4.72307 | 0.023 | 2.282 | 0.346 | 4.44038 | 0.003848 | 2014-05-14 |
| 1 | 11 UMi b | 11 UMi | 1 | 1 | 516.2200 | 3.2500 | 1.540 | 0.070 | NaN | NaN | ... | 0.15 | 125.3210 | 1.97650 | 5.01300 | 0.005 | 1.939 | 0.270 | 4.56216 | 0.003903 | 2018-04-25 |
| 2 | 14 And b | 14 And | 1 | 1 | 185.8400 | 0.2300 | 0.830 | NaN | NaN | NaN | ... | 0.07 | 75.4392 | 0.71400 | 5.23133 | 0.023 | 2.331 | 0.240 | 4.91781 | 0.002826 | 2014-05-14 |
| 3 | 14 Her b | 14 Her | 1 | 2 | 1766.4100 | 0.6700 | 2.830 | 0.041 | NaN | NaN | ... | 0.03 | 17.9323 | 0.00730 | 6.61935 | 0.023 | 4.714 | 0.016 | 6.38300 | 0.000351 | 2021-09-20 |
| 4 | 16 Cyg B b | 16 Cyg B | 3 | 1 | 799.5000 | 0.6000 | 1.680 | 0.030 | NaN | NaN | ... | 0.10 | 21.1397 | 0.01100 | 6.21500 | 0.016 | 4.651 | 0.016 | 6.06428 | 0.000603 | 2018-04-25 |
| 5 | 17 Sco b | 17 Sco | 1 | 1 | 578.3800 | 2.0100 | 1.450 | 0.020 | NaN | NaN | ... | 0.04 | 124.9530 | 2.59000 | 5.22606 | 0.023 | 2.094 | 0.244 | 4.75429 | 0.005055 | 2021-10-25 |
| 6 | 18 Del b | 18 Del | 2 | 1 | 993.3000 | 3.2000 | 2.600 | NaN | NaN | NaN | ... | 0.06 | 76.2220 | 0.62170 | 5.51048 | 0.023 | 3.366 | 0.204 | 5.27476 | 0.002654 | 2014-05-14 |
| 7 | 1RXS J160929.1-210524 b | 1RXS J160929.1-210524 | 1 | 1 | NaN | NaN | 330.000 | NaN | NaN | NaN | ... | 0.14 | 139.1350 | 1.33200 | 12.61800 | 0.069 | 8.916 | 0.021 | 12.05720 | 0.002275 | 2015-04-01 |
| 8 | 24 Boo b | 24 Boo | 1 | 1 | 30.3506 | 0.0078 | 0.190 | 0.012 | NaN | NaN | ... | 0.10 | 95.9863 | 0.63685 | 5.59000 | 0.001 | 3.159 | 0.280 | 5.33390 | 0.002000 | 2018-04-25 |
| 9 | 24 Sex b | 24 Sex | 1 | 2 | 452.8000 | 2.1000 | 1.333 | 0.004 | NaN | NaN | ... | 0.10 | 72.0691 | 0.68540 | 6.45350 | 0.023 | 4.285 | 0.016 | 6.20374 | 0.000498 | 2014-05-14 |
10 rows × 39 columns
for name, val in df.items():
na_ratio = val.isnull().sum() / len(val)
if na_ratio > 0.3:
print(f"{name:<16} {na_ratio*100:.3f}%")
pl_orbsmaxerr1 58.289% pl_radj 69.964% pl_bmasse 60.559% pl_bmasseerr1 63.113% pl_bmassj 60.559% pl_bmassjerr1 63.133% pl_insol 41.021% pl_insolerr1 41.488%
for name, val in df.items():
na_ratio = val.isnull().sum() / len(val)
if na_ratio < 0.3 and na_ratio != 0:
print(f"{name:<16} {na_ratio*100:.3f}%")
pl_orbper 1.317% pl_orbpererr1 2.209% pl_orbsmax 6.769% pl_rade 20.389% pl_radeerr1 20.551% pl_orbeccen 14.714% pl_eqt 27.949% st_teff 2.777% st_tefferr1 4.054% st_rad 5.432% st_raderr1 6.303% st_mass 0.608% st_masserr1 3.324% st_met 7.256% st_meterr1 9.931% st_logg 7.013% st_loggerr1 7.803% sy_dist 2.250% sy_disterr1 4.236% sy_vmag 0.872% sy_vmagerr1 0.932% sy_kmag 0.912% sy_kmagerr1 1.480% sy_gaiamag 1.743% sy_gaiamagerr1 1.743% rowupdate 0.081% pl_rade 20.389% pl_radeerr1 20.551% pl_orbeccen 14.714% pl_eqt 27.949% st_teff 2.777% st_tefferr1 4.054% st_rad 5.432% st_raderr1 6.303% st_mass 0.608% st_masserr1 3.324% st_met 7.256% st_meterr1 9.931% st_logg 7.013% st_loggerr1 7.803% sy_dist 2.250% sy_disterr1 4.236% sy_vmag 0.872% sy_vmagerr1 0.932% sy_kmag 0.912% sy_kmagerr1 1.480% sy_gaiamag 1.743% sy_gaiamagerr1 1.743% rowupdate 0.081%
df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4934 entries, 0 to 4933 Columns: 39 entries, pl_name to rowupdate dtypes: float64(33), int64(3), object(3) memory usage: 1.5+ MB
df.select_dtypes(object).value_counts()
pl_name hostname rowupdate
11 Com b 11 Com 2014-05-14 1
Kepler-299 b Kepler-299 2014-11-18 1
Kepler-300 b Kepler-300 2014-11-18 1
Kepler-30 d Kepler-30 2014-05-14 1
Kepler-30 c Kepler-30 2014-05-14 1
..
KELT-21 b KELT-21 2018-02-01 1
KELT-20 b KELT-20 2019-12-02 1
KELT-2 A b KELT-2 A 2019-03-18 1
KELT-19 A b KELT-19 A 2018-01-08 1
xi Aql b xi Aql 2014-05-14 1
Length: 4930, dtype: int64
label_df = df[['pl_name','hostname']]
df = df.drop(['pl_name','hostname','rowupdate'], axis=1)
label_df.head()
| pl_name | hostname | |
|---|---|---|
| 0 | 11 Com b | 11 Com |
| 1 | 11 UMi b | 11 UMi |
| 2 | 14 And b | 14 And |
| 3 | 14 Her b | 14 Her |
| 4 | 16 Cyg B b | 16 Cyg B |
scaler = MinMaxScaler()
df.apply(np.abs)
df = pd.DataFrame(scaler.fit_transform(df), columns = df.columns, index=df.index)
df.head(10)
| sy_snum | sy_pnum | pl_orbper | pl_orbpererr1 | pl_orbsmax | pl_orbsmaxerr1 | pl_rade | pl_radeerr1 | pl_radj | pl_bmasse | ... | st_logg | st_loggerr1 | sy_dist | sy_disterr1 | sy_vmag | sy_vmagerr1 | sy_kmag | sy_kmagerr1 | sy_gaiamag | sy_gaiamagerr1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.333333 | 0.000000 | 8.105728e-07 | 6.808510e-10 | 0.000171 | 9.606148e-06 | NaN | NaN | NaN | 0.348966 | ... | 0.177419 | 0.090909 | 0.010443 | 0.000517 | 0.094509 | 0.006569 | 0.169975 | 0.033554 | 0.087725 | 0.059038 |
| 1 | 0.000000 | 0.000000 | 1.283682e-06 | 6.914894e-09 | 0.000204 | 1.344861e-05 | NaN | NaN | NaN | 0.188874 | ... | 0.073314 | 0.136364 | 0.014095 | 0.000531 | 0.101625 | 0.001194 | 0.159029 | 0.025942 | 0.094780 | 0.059919 |
| 2 | 0.000000 | 0.000000 | 4.618415e-07 | 4.893617e-10 | 0.000110 | NaN | NaN | NaN | NaN | 0.086341 | ... | 0.224340 | 0.063636 | 0.008426 | 0.000192 | 0.106983 | 0.006569 | 0.171539 | 0.022937 | 0.115386 | 0.042849 |
| 3 | 0.000000 | 0.142857 | 4.393608e-06 | 1.425532e-09 | 0.000376 | 7.877041e-06 | NaN | NaN | NaN | 0.087244 | ... | 0.488270 | 0.027273 | 0.001890 | 0.000002 | 0.141046 | 0.006569 | 0.247590 | 0.000501 | 0.200276 | 0.003632 |
| 4 | 0.666667 | 0.000000 | 1.988359e-06 | 1.276596e-09 | 0.000223 | 5.763689e-06 | NaN | NaN | NaN | 0.030219 | ... | 0.476540 | 0.090909 | 0.002255 | 0.000003 | 0.131123 | 0.004479 | 0.245580 | 0.000501 | 0.181810 | 0.007620 |
| 5 | 0.000000 | 0.000000 | 1.438309e-06 | 4.276596e-09 | 0.000192 | 3.842459e-06 | NaN | NaN | NaN | 0.077710 | ... | 0.087977 | 0.036364 | 0.014053 | 0.000696 | 0.106853 | 0.006569 | 0.163975 | 0.023337 | 0.105912 | 0.078165 |
| 6 | 0.333333 | 0.000000 | 2.470448e-06 | 6.808511e-09 | 0.000346 | NaN | NaN | NaN | NaN | 0.185276 | ... | 0.252199 | 0.054545 | 0.008515 | 0.000167 | 0.113833 | 0.006569 | 0.204570 | 0.019331 | 0.136067 | 0.040119 |
| 7 | 0.000000 | 0.000000 | NaN | NaN | 0.043964 | NaN | NaN | NaN | NaN | 0.169796 | ... | 0.425220 | 0.127273 | 0.015665 | 0.000358 | 0.288260 | 0.020305 | 0.381694 | 0.001002 | 0.529028 | 0.034115 |
| 8 | 0.000000 | 0.000000 | 7.505195e-08 | 1.659572e-11 | 0.000025 | 2.305476e-06 | NaN | NaN | NaN | 0.016369 | ... | 0.193548 | 0.090909 | 0.010761 | 0.000171 | 0.115785 | 0.000000 | 0.197964 | 0.026943 | 0.139493 | 0.029754 |
| 9 | 0.000000 | 0.142857 | 1.125921e-06 | 4.468085e-09 | 0.000177 | 7.684918e-07 | NaN | NaN | NaN | 0.035795 | ... | 0.351906 | 0.090909 | 0.008043 | 0.000184 | 0.136976 | 0.006569 | 0.233899 | 0.000501 | 0.189890 | 0.005952 |
10 rows × 36 columns
get_cols_info()
imputer = KNNImputer(n_neighbors=70)
df = pd.DataFrame(imputer.fit_transform(df),columns = df.columns, index = df.index)
get_cols_info()
df.isna().sum().sum()
0
field_cols = [
'pl_orbsmax',
'pl_bmasse',
'pl_bmassj',
'sy_dist',
'st_teff',
'st_rad',
'st_mass',
'st_met',
'st_logg',
'sy_vmag',
'sy_kmag',
'sy_gaiamag',
'pl_rade',
'pl_orbper',
]
A special case is where metallicity is 0, it must not be zero thus half of the error margin is added to the value in order to fix the problem
overall_error_margins = {i:0 for i in df.index}
for i in df.index:
for col in field_cols:
if df[col][i] == 0:
df[col][i] = df[col + 'err1'][i] / 2
overall_error_margins[i] = (overall_error_margins[i] + df[col][i]) / 2
else:
margin = df[col + 'err1'][i] / df[col][i]
overall_error_margins[i] = (overall_error_margins[i] + margin) / 2
for k,v in overall_error_margins.items():
if v > 0.5:
df.drop(i, axis = 'rows')
df = pd.DataFrame(scaler.inverse_transform(df), columns = df.columns, index=df.index)
df_habitable_zone = df.copy()
df_habitable_zone = df_habitable_zone[["pl_orbsmax","st_rad","st_teff"]]
df_habitable_zone["st_rad_meters"] = df_habitable_zone["st_rad"]*6.957*10**8
df_habitable_zone["st_rad_squared"]=df_habitable_zone["st_rad_meters"]**2
df_habitable_zone["st_teff_power4"]=df_habitable_zone["st_teff"]**4
df_habitable_zone['st_luminosity'] = df_habitable_zone.apply(lambda row: 4*math.pi*row.st_rad_squared*5.67*10**(-8)*row.st_teff_power4, axis = 1)
Calculated luminosity divided to sun's luminosity
df_habitable_zone["st_luminosity_reduced"] = df_habitable_zone["st_luminosity"]/(3.828*10**26)
Max and min distance for habitable zone calculated
df_habitable_zone["st_luminosity_reduced_sqrt"] = df_habitable_zone["st_luminosity_reduced"]**(1/2)
df_habitable_zone["min_distance"] = df_habitable_zone["st_luminosity_reduced_sqrt"]*0.95
df_habitable_zone["max_distance"] = df_habitable_zone["st_luminosity_reduced_sqrt"]*1.37
Finding plantes within the habitable zone
bigger_than_min = df_habitable_zone[df_habitable_zone['pl_orbsmax']>df_habitable_zone['min_distance']]
in_the_zone = bigger_than_min[bigger_than_min['pl_orbsmax']<bigger_than_min['max_distance']]
in_the_zone
| pl_orbsmax | st_rad | st_teff | st_rad_meters | st_rad_squared | st_teff_power4 | st_luminosity | st_luminosity_reduced | st_luminosity_reduced_sqrt | min_distance | max_distance | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 31 | 0.77330 | 0.960000 | 5250.000000 | 6.678720e+08 | 4.460530e+17 | 7.596914e+14 | 2.414441e+26 | 0.630732 | 0.794186 | 0.754477 | 1.088035 |
| 36 | 1.90000 | 2.300000 | 4792.000000 | 1.600110e+09 | 2.560352e+18 | 5.273115e+14 | 9.619663e+26 | 2.512974 | 1.585236 | 1.505974 | 2.171773 |
| 48 | 0.78000 | 0.860000 | 4864.000000 | 5.983020e+08 | 3.579653e+17 | 5.597244e+14 | 1.427605e+26 | 0.372938 | 0.610686 | 0.580152 | 0.836640 |
| 55 | 0.83000 | 1.143000 | 5004.000000 | 7.951851e+08 | 6.323193e+17 | 6.270024e+14 | 2.824871e+26 | 0.737950 | 0.859040 | 0.816088 | 1.176885 |
| 66 | 0.68000 | 1.022714 | 4746.000000 | 7.115023e+08 | 5.062356e+17 | 5.073538e+14 | 1.830023e+26 | 0.478062 | 0.691421 | 0.656850 | 0.947246 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4655 | 0.02817 | 0.120000 | 2559.000000 | 8.348400e+07 | 6.969578e+15 | 4.288260e+13 | 2.129514e+23 | 0.000556 | 0.023586 | 0.022407 | 0.032313 |
| 4678 | 5.90000 | 2.734000 | 7410.989286 | 1.902044e+09 | 3.617771e+18 | 3.016510e+15 | 7.775686e+27 | 20.312658 | 4.506957 | 4.281609 | 6.174531 |
| 4797 | 1.07000 | 0.870000 | 5545.000000 | 6.052590e+08 | 3.663385e+17 | 9.453795e+14 | 2.467639e+26 | 0.644629 | 0.802888 | 0.762743 | 1.099956 |
| 4804 | 1.36000 | 1.150000 | 5576.000000 | 8.000550e+08 | 6.400880e+17 | 9.666985e+14 | 4.408833e+26 | 1.151733 | 1.073188 | 1.019529 | 1.470268 |
| 4931 | 1.00000 | 1.000000 | 5780.000000 | 6.957000e+08 | 4.839985e+17 | 1.116121e+15 | 3.849003e+26 | 1.005487 | 1.002740 | 0.952603 | 1.373753 |
111 rows × 11 columns
df['pl_zone_flag'] = 0
df.iloc[in_the_zone.index,df.columns.get_loc('pl_zone_flag')] = 1
df.iloc[in_the_zone.index]
| sy_snum | sy_pnum | pl_orbper | pl_orbpererr1 | pl_orbsmax | pl_orbsmaxerr1 | pl_rade | pl_radeerr1 | pl_radj | pl_bmasse | ... | st_loggerr1 | sy_dist | sy_disterr1 | sy_vmag | sy_vmagerr1 | sy_kmag | sy_kmagerr1 | sy_gaiamag | sy_gaiamagerr1 | pl_zone_flag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 31 | 2.0 | 5.0 | 260.910000 | 0.360000 | 0.77330 | 0.001000 | 2.831871 | 0.210214 | 0.264714 | 47.00510 | ... | 0.050000 | 12.585500 | 0.012400 | 5.950840 | 0.023000 | 4.015000 | 0.0360 | 5.729730 | 0.000852 | 1 |
| 36 | 1.0 | 2.0 | 763.000000 | 17.000000 | 1.90000 | 0.100000 | 8.269071 | 0.873386 | 0.737986 | 826.30000 | ... | 0.100000 | 19.810100 | 0.164800 | 3.950000 | 0.030000 | 1.564000 | 0.2520 | 3.678530 | 0.004135 | 1 |
| 48 | 1.0 | 1.0 | 268.940000 | 0.990000 | 0.78000 | 0.050000 | 7.600829 | 0.626957 | 0.678786 | 330.54320 | ... | 0.290000 | 49.352000 | 0.084600 | 9.780000 | 0.030000 | 7.336000 | 0.0230 | 9.351080 | 0.000253 | 1 |
| 55 | 1.0 | 1.0 | 307.880000 | 1.470000 | 0.83000 | 0.040000 | 7.412557 | 0.565229 | 0.661286 | 432.24663 | ... | 0.130000 | 53.622000 | 0.191900 | 9.670000 | 0.030000 | 7.557000 | 0.0240 | 9.412660 | 0.000255 | 1 |
| 66 | 1.0 | 2.0 | 237.600000 | 1.500000 | 0.68000 | 0.020000 | 5.331314 | 0.557814 | 0.475629 | 104.00000 | ... | 0.260000 | 41.334200 | 0.068500 | 9.860000 | 0.030000 | 7.323000 | 0.0210 | 9.503310 | 0.000418 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4655 | 1.0 | 7.0 | 6.099615 | 0.000011 | 0.02817 | 0.000830 | 0.918000 | 0.039000 | 0.082000 | 0.62000 | ... | 0.083143 | 701.565214 | 11.812229 | 17.020000 | 0.200000 | 10.296000 | 0.0230 | 15.645100 | 0.001448 | 1 |
| 4678 | 2.0 | 2.0 | 5840.000000 | 1095.000000 | 5.90000 | 1.400000 | 7.225214 | 0.769229 | 0.974786 | 2002.20000 | ... | 0.081571 | 1911.177009 | 290.611748 | 16.775300 | 0.093900 | 14.776000 | 0.1240 | 16.693700 | 0.063232 | 1 |
| 4797 | 1.0 | 2.0 | 421.000000 | 2.000000 | 1.07000 | 0.030000 | 2.837986 | 0.221629 | 0.253900 | 1010.69940 | ... | 0.020000 | 163.371000 | 1.067000 | 11.632000 | 0.018000 | 9.677000 | 0.0190 | 11.282300 | 0.001447 | 1 |
| 4804 | 1.0 | 4.0 | 572.000000 | 7.000000 | 1.36000 | 0.040000 | 2.401743 | 0.554943 | 0.207971 | 394.10920 | ... | 0.050000 | 264.780000 | 4.888000 | 11.936000 | 0.046000 | 10.192000 | 0.0260 | 11.788900 | 0.000253 | 1 |
| 4931 | 1.0 | 8.0 | 365.000000 | 0.000463 | 1.00000 | 0.006004 | 1.000000 | 0.793886 | 0.378700 | 1.00000 | ... | 0.086714 | 767.962063 | 19.819703 | 14.800471 | 0.187057 | 12.307871 | 0.0308 | 14.514798 | 0.000519 | 1 |
111 rows × 37 columns
in_habitable = df.iloc[in_the_zone.index]
df = pd.DataFrame(scaler.fit_transform(df), columns = df.columns, index=df.index)
df = df.drop(columns=[col + 'err1' for col in field_cols])
df.columns
Index(['sy_snum', 'sy_pnum', 'pl_orbper', 'pl_orbsmax', 'pl_rade', 'pl_radj',
'pl_bmasse', 'pl_bmassj', 'pl_orbeccen', 'pl_insol', 'pl_insolerr1',
'pl_eqt', 'ttv_flag', 'st_teff', 'st_rad', 'st_mass', 'st_met',
'st_logg', 'sy_dist', 'sy_vmag', 'sy_kmag', 'sy_gaiamag',
'pl_zone_flag'],
dtype='object')
df_planets = df.loc[:,df.columns.str.startswith('pl')].copy() # for clustering only planet data
df_stellar = df.loc[:,df.columns.str.startswith('st')].copy() # for clustering only stellar data
df.head(10)
| sy_snum | sy_pnum | pl_orbper | pl_orbsmax | pl_rade | pl_radj | pl_bmasse | pl_bmassj | pl_orbeccen | pl_insol | ... | st_teff | st_rad | st_mass | st_met | st_logg | sy_dist | sy_vmag | sy_kmag | sy_gaiamag | pl_zone_flag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.333333 | 0.000000 | 8.105728e-07 | 0.000171 | 0.002763 | 0.145104 | 0.348966 | 0.348983 | 0.250814 | 0.019658 | ... | 0.073760 | 0.226638 | 0.114225 | 0.597345 | 0.165179 | 0.010443 | 0.091525 | 0.164578 | 0.085499 | 0.0 |
| 1 | 0.000000 | 0.000000 | 1.283682e-06 | 0.000204 | 0.002978 | 0.147406 | 0.188874 | 0.188882 | 0.086862 | 0.007888 | ... | 0.066635 | 0.287266 | 0.076008 | 0.769912 | 0.059524 | 0.014095 | 0.098664 | 0.153560 | 0.092572 | 0.0 |
| 2 | 0.000000 | 0.000000 | 4.618415e-07 | 0.000110 | 0.002407 | 0.122121 | 0.086341 | 0.086345 | 0.000000 | 0.008679 | ... | 0.075018 | 0.131161 | 0.092994 | 0.646018 | 0.212798 | 0.008426 | 0.104040 | 0.166152 | 0.113227 | 0.0 |
| 3 | 0.000000 | 0.142857 | 4.393608e-06 | 0.000376 | 0.002209 | 0.109308 | 0.087244 | 0.087245 | 0.398914 | 0.004890 | ... | 0.083915 | 0.011815 | 0.040764 | 0.931416 | 0.480655 | 0.001890 | 0.138216 | 0.242698 | 0.198325 | 0.0 |
| 4 | 0.666667 | 0.000000 | 1.988359e-06 | 0.000223 | 0.002522 | 0.124687 | 0.030219 | 0.030220 | 0.748100 | 0.007675 | ... | 0.091573 | 0.013367 | 0.043312 | 0.778761 | 0.468750 | 0.002255 | 0.128260 | 0.240674 | 0.179814 | 0.0 |
| 5 | 0.000000 | 0.000000 | 1.438309e-06 | 0.000192 | 0.003021 | 0.149514 | 0.077710 | 0.077711 | 0.065147 | 0.007749 | ... | 0.063391 | 0.309225 | 0.051380 | 0.747788 | 0.074405 | 0.014053 | 0.103910 | 0.158539 | 0.103730 | 0.0 |
| 6 | 0.333333 | 0.000000 | 2.470448e-06 | 0.000346 | 0.003247 | 0.160836 | 0.185276 | 0.185284 | 0.086862 | 0.010599 | ... | 0.077961 | 0.101325 | 0.097240 | 0.729204 | 0.241071 | 0.008515 | 0.110913 | 0.199398 | 0.133959 | 0.0 |
| 7 | 0.000000 | 0.000000 | 2.377407e-08 | 0.043964 | 0.000513 | 0.028659 | 0.169796 | 0.143910 | 0.062494 | 0.021936 | ... | 0.060484 | 0.015515 | 0.025902 | 0.744014 | 0.416667 | 0.015665 | 0.285914 | 0.377674 | 0.527879 | 0.0 |
| 8 | 0.000000 | 0.000000 | 7.505195e-08 | 0.000025 | 0.002386 | 0.121190 | 0.016369 | 0.016369 | 0.045603 | 0.007481 | ... | 0.076436 | 0.126865 | 0.041614 | 0.411504 | 0.181548 | 0.010761 | 0.112871 | 0.192749 | 0.137394 | 0.0 |
| 9 | 0.000000 | 0.142857 | 1.125921e-06 | 0.000177 | 0.001834 | 0.090893 | 0.035795 | 0.035797 | 0.097720 | 0.005288 | ... | 0.080070 | 0.058360 | 0.064968 | 0.738938 | 0.342262 | 0.008043 | 0.134132 | 0.228918 | 0.187913 | 0.0 |
10 rows × 23 columns
df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4934 entries, 0 to 4933 Columns: 23 entries, sy_snum to pl_zone_flag dtypes: float64(23) memory usage: 886.7 KB
inertias = []
K = range(1, 10)
for k in K:
# Building and fitting the model
kmeanModel = KMeans(n_clusters=k).fit(df)
kmeanModel.fit(df)
inertias.append(kmeanModel.inertia_)
plt.plot(K, inertias, 'o-')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('The Elbow Method using Inertia')
plt.grid(True, alpha=0.3, linestyle="--")
plt.show()
Performing principal component analysis in order to show the outcome of the clustering in 2D graphs
reduced_data = PCA(n_components=2).fit_transform(df)
results_df = pd.DataFrame(reduced_data,columns=['pca1','pca2'],index = df.index)
results_df.head()
| pca1 | pca2 | |
|---|---|---|
| 0 | -0.424423 | 0.529628 |
| 1 | -0.373081 | 0.451835 |
| 2 | -0.312190 | 0.363155 |
| 3 | -0.304828 | 0.440353 |
| 4 | -0.454854 | 0.644464 |
reduced_data_planets = PCA(n_components=2).fit_transform(df_planets)
results_planets = pd.DataFrame(reduced_data_planets,columns=['pca1','pca2'],index = df_planets.index)
results_planets.head()
| pca1 | pca2 | |
|---|---|---|
| 0 | 0.152635 | -0.207132 |
| 1 | 0.023081 | -0.079000 |
| 2 | -0.052293 | 0.004291 |
| 3 | 0.215823 | -0.244640 |
| 4 | 0.452783 | -0.484134 |
reduced_data_stellar = PCA(n_components=2).fit_transform(df_stellar)
results_stellar = pd.DataFrame(reduced_data_stellar,columns=['pca1','pca2'],index = df_stellar.index)
df_stellar.head()
| st_teff | st_rad | st_mass | st_met | st_logg | |
|---|---|---|---|---|---|
| 0 | 0.073760 | 0.226638 | 0.114225 | 0.597345 | 0.165179 |
| 1 | 0.066635 | 0.287266 | 0.076008 | 0.769912 | 0.059524 |
| 2 | 0.075018 | 0.131161 | 0.092994 | 0.646018 | 0.212798 |
| 3 | 0.083915 | 0.011815 | 0.040764 | 0.931416 | 0.480655 |
| 4 | 0.091573 | 0.013367 | 0.043312 | 0.778761 | 0.468750 |
results_habitable = results_df.iloc[in_habitable.index]
results_habitable
| pca1 | pca2 | |
|---|---|---|
| 31 | -0.204516 | 0.581858 |
| 36 | -0.376450 | 0.610536 |
| 48 | -0.278978 | 0.328454 |
| 55 | -0.238870 | 0.248948 |
| 66 | -0.209575 | 0.296518 |
| ... | ... | ... |
| 4655 | 1.013363 | 0.501412 |
| 4678 | 0.008089 | -0.123835 |
| 4797 | -0.157946 | 0.218952 |
| 4804 | -0.044543 | 0.195879 |
| 4931 | 1.043370 | 0.534803 |
111 rows × 2 columns
kmeans = KMeans(n_clusters=3, n_init=100, max_iter=500)
kmeans.fit(results_df)
y = kmeans.predict(results_df)
centers = kmeans.cluster_centers_
Put clusters inside dataframe as a column
results_df['clusters'] = y
results_df.head()
| pca1 | pca2 | clusters | |
|---|---|---|---|
| 0 | -0.424423 | 0.529628 | 2 |
| 1 | -0.373081 | 0.451835 | 2 |
| 2 | -0.312190 | 0.363155 | 2 |
| 3 | -0.304828 | 0.440353 | 2 |
| 4 | -0.454854 | 0.644464 | 2 |
Checking for number of planets in each cluster
results_df.clusters.value_counts()
0 3314 2 1281 1 339 Name: clusters, dtype: int64
kmeans = KMeans(n_clusters=3, n_init=100, max_iter=500) #n_init="auto" ne yapıyor tam olarak bak!!!
kmeans.fit(results_planets)
y = kmeans.predict(results_planets)
pl_centers = kmeans.cluster_centers_
Put clusters inside dataframe as a column
results_planets['clusters'] = y
results_planets.head()
| pca1 | pca2 | clusters | |
|---|---|---|---|
| 0 | 0.152635 | -0.207132 | 2 |
| 1 | 0.023081 | -0.079000 | 0 |
| 2 | -0.052293 | 0.004291 | 0 |
| 3 | 0.215823 | -0.244640 | 2 |
| 4 | 0.452783 | -0.484134 | 2 |
Checking for number of planets in each cluster
results_planets.clusters.value_counts()
0 4289 2 534 1 111 Name: clusters, dtype: int64
kmeans = KMeans(n_clusters=3, n_init=100, max_iter=500) #n_init="auto" ne yapıyor tam olarak bak!!!
kmeans.fit(results_stellar)
y = kmeans.predict(results_stellar)
st_centers = kmeans.cluster_centers_
Put clusters inside dataframe as a column
results_stellar['clusters'] = y
results_stellar.head()
| pca1 | pca2 | clusters | |
|---|---|---|---|
| 0 | 0.108420 | 0.385679 | 1 |
| 1 | -0.069595 | 0.487771 | 1 |
| 2 | 0.063617 | 0.290262 | 1 |
| 3 | -0.199853 | -0.023429 | 2 |
| 4 | -0.048772 | -0.002164 | 2 |
Checking for number of planets in each cluster
results_stellar.clusters.value_counts()
2 2538 0 2257 1 139 Name: clusters, dtype: int64
This clustering will be used to compare the accuracy of KMeans clustering
data = list(zip(results_df["pca1"], results_df["pca2"]))
linkage_data = linkage(data, method='ward', metric='euclidean')
hierarchical_cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
labels = hierarchical_cluster.fit_predict(data)
colors = ['#DF2020', '#81DF20', '#2095DF']
def plot_clusters(df: pd.DataFrame, title: str, centers) -> pd.DataFrame:
df['c'] = df['clusters'].map({0: colors[0], 1: colors[1], 2: colors[2]})
fig = plt.gcf()
fig.set_size_inches(16,9)
sns.scatterplot(x="pca1", y="pca2", hue=df['clusters'], data=df, alpha=0.5, palette=colors)
plt.title(title)
pca1 = df['pca1']
pca2 = df['pca2']
# plot pca1 mean
plt.plot([pca1.mean()] * 2, [-pca1.max(),pca1.max()], color='black', lw=0.5, linestyle='--')
# plot pca2 mean
plt.plot([-pca2.max(),pca2.max()], [pca2.mean()]*2, color='black', lw=0.5, linestyle='--')
# create a list of legend elemntes
## average line
legend_elements = [Line2D([0], [0], color='black', lw=0.5, linestyle='--', label='Average')]
## markers / records
cluster_leg = [Line2D([0], [0], marker='o', color='w', label='Cluster {}'.format(i+1),
markerfacecolor=mcolor, markersize=5) for i, mcolor in enumerate(colors)]
## centroids
cent_leg = [Line2D([0], [0], marker='^', color='w', label='Centroid - C{}'.format(i+1),
markerfacecolor=mcolor, markersize=10) for i, mcolor in enumerate(colors)]
# add all elements to the same list
legend_elements.extend(cluster_leg)
legend_elements.extend(cent_leg)
for i in df.clusters.unique():
# get the convex hull
points = df[df.clusters == i][['pca1','pca2']].values
hull = ConvexHull(points)
x_hull = np.append(points[hull.vertices,0],
points[hull.vertices,0][0])
y_hull = np.append(points[hull.vertices,1],
points[hull.vertices,1][0])
# interpolate
dist = np.sqrt((x_hull[:-1] - x_hull[1:])**2 + (y_hull[:-1] - y_hull[1:])**2)
dist_along = np.concatenate(([0], dist.cumsum()))
spline, u = interpolate.splprep([x_hull, y_hull],
u=dist_along, s=0, per=1)
interp_d = np.linspace(dist_along[0], dist_along[-1], 50)
interp_x, interp_y = interpolate.splev(interp_d, spline)
# plot shape
plt.fill(interp_x, interp_y, '--', c=colors[i], alpha=0.2)
#get earth mars and venus index
e_index = label_df[label_df['pl_name'] == 'Earth'].index
m_index = label_df[label_df['pl_name'] == 'Mars'].index
v_index = label_df[label_df['pl_name'] == 'Venus'].index
#earths cluster
e_cluster = df.iloc[e_index].clusters.values[0]
#scatter centeroids
plt.scatter(centers[:, 0], centers[:, 1], c=colors, marker = '^', s=100,edgecolors= "black")
#add earth, mars and venus
plt.scatter(df.iloc[e_index]['pca1'],df.iloc[e_index]['pca2'], c='darkblue',edgecolors= "black")
plt.scatter(df.iloc[m_index]['pca1'],df.iloc[m_index]['pca2'], c='orange',edgecolors= "black")
plt.scatter(df.iloc[v_index]['pca1'],df.iloc[v_index]['pca2'], c='lightblue',edgecolors= "black")
#earth mars and venus legends
earth = [Line2D([0], [0], marker='o', color='w', label=name,
markerfacecolor=mcolor, markersize=10) for name, mcolor in {'Earth':'darkblue','Mars':'orange','Venus':'lightblue'}.items()]
legend_elements.extend(earth)
# plot legend
plt.legend(handles=legend_elements, loc='upper right', ncol=2)
plt.show()
#pick planets that share the same cluster as earth
return_df = label_df.iloc[df[df['clusters'] == e_cluster].index]
#drop earth
return_df = return_df.drop(index=e_index)
return return_df
overall_clustered = plot_clusters(results_df,'Default Attributes Based K-Means Clustering',centers)
| pl_name | hostname | |
|---|---|---|
| 45 | AU Mic b | AU Mic |
| 89 | CoRoT-2 b | CoRoT-2 |
| 296 | HAT-P-13 b | HAT-P-13 |
| 855 | HD 28109 c | HD 28109 |
| 856 | HD 28109 d | HD 28109 |
| ... | ... | ... |
| 4799 | WASP-43 b | WASP-43 |
| 4803 | WASP-47 b | WASP-47 |
| 4805 | WASP-47 d | WASP-47 |
| 4932 | Venus | Sol |
| 4933 | Mars | Sol |
338 rows × 2 columns
overall_clustered
| pl_name | hostname | |
|---|---|---|
| 45 | AU Mic b | AU Mic |
| 89 | CoRoT-2 b | CoRoT-2 |
| 296 | HAT-P-13 b | HAT-P-13 |
| 855 | HD 28109 c | HD 28109 |
| 856 | HD 28109 d | HD 28109 |
| ... | ... | ... |
| 4799 | WASP-43 b | WASP-43 |
| 4803 | WASP-47 b | WASP-47 |
| 4805 | WASP-47 d | WASP-47 |
| 4932 | Venus | Sol |
| 4933 | Mars | Sol |
338 rows × 2 columns
fig = plt.gcf()
fig.set_size_inches(16,9)
dendrogram(linkage_data)
plt.title("Dendrogram")
plt.figure(figsize=(20,20))
plt.show()
<Figure size 7000x7000 with 0 Axes>
data_x=[i[0] for i in data]
data_y=[i[1] for i in data]
fig = plt.gcf()
fig.set_size_inches(16,9)
plt.scatter(data_x, data_y, c=labels, alpha=0.5)
plt.title("Hierarchical Clustering")
plt.show()
pl_clustered = plot_clusters(results_planets, 'Planetary Attribute Based K-Means Clustering', pl_centers)
pl_clustered
| pl_name | hostname | |
|---|---|---|
| 31 | 55 Cnc f | 55 Cnc |
| 36 | 7 CMa b | 7 CMa |
| 48 | BD+14 4559 b | BD+14 4559 |
| 55 | BD+45 564 b | BD+45 564 |
| 66 | BD-08 2823 c | BD-08 2823 |
| ... | ... | ... |
| 4637 | TOI-700 d | TOI-700 |
| 4655 | TRAPPIST-1 e | TRAPPIST-1 |
| 4678 | UZ For b | UZ For |
| 4797 | WASP-41 c | WASP-41 |
| 4804 | WASP-47 c | WASP-47 |
110 rows × 2 columns
st_clustered = plot_clusters(results_stellar, 'Host Star Based Clustering', st_centers)
st_clustered
| pl_name | hostname | |
|---|---|---|
| 3 | 14 Her b | 14 Her |
| 4 | 16 Cyg B b | 16 Cyg B |
| 7 | 1RXS J160929.1-210524 b | 1RXS J160929.1-210524 |
| 9 | 24 Sex b | 24 Sex |
| 10 | 24 Sex c | 24 Sex |
| ... | ... | ... |
| 4926 | ups And b | ups And |
| 4927 | ups And c | ups And |
| 4928 | ups And d | ups And |
| 4932 | Venus | Sol |
| 4933 | Mars | Sol |
2537 rows × 2 columns
# scatter plot of distribution of planets in habitable zone among all other planets
fig = plt.gcf()
fig.set_size_inches(16,9)
sns.scatterplot(x="pca1", y="pca2", data=results_df, alpha=0.5, color = 'gray')
plt.title('Planets within Habitable Radius')
plt.scatter(results_habitable['pca1'],results_habitable['pca2'], c='green', alpha=0.5)
plt.show()
Planets that share the same planetary characteristics along with host stars that share the same characteristics with Sol
final_df = pd.concat([pl_clustered, st_clustered], axis=1, join="inner")
final_df
| pl_name | hostname | pl_name | hostname | |
|---|---|---|---|---|
| 31 | 55 Cnc f | 55 Cnc | 55 Cnc f | 55 Cnc |
| 36 | 7 CMa b | 7 CMa | 7 CMa b | 7 CMa |
| 48 | BD+14 4559 b | BD+14 4559 | BD+14 4559 b | BD+14 4559 |
| 174 | GJ 1148 b | GJ 1148 | GJ 1148 b | GJ 1148 |
| 193 | GJ 229 b | GJ 229 | GJ 229 b | GJ 229 |
| ... | ... | ... | ... | ... |
| 4541 | TOI-1516 b | TOI-1516 | TOI-1516 b | TOI-1516 |
| 4655 | TRAPPIST-1 e | TRAPPIST-1 | TRAPPIST-1 e | TRAPPIST-1 |
| 4678 | UZ For b | UZ For | UZ For b | UZ For |
| 4797 | WASP-41 c | WASP-41 | WASP-41 c | WASP-41 |
| 4804 | WASP-47 c | WASP-47 | WASP-47 c | WASP-47 |
64 rows × 4 columns
Planets that share the same planetary characteristics along with host stars that share the same characteristics with Sol intersected with default clustering output
final_df = pd.concat([pl_clustered, st_clustered, overall_clustered], axis=1, join="inner")
final_df
| pl_name | hostname | pl_name | hostname | pl_name | hostname | |
|---|---|---|---|---|---|---|
| 4655 | TRAPPIST-1 e | TRAPPIST-1 | TRAPPIST-1 e | TRAPPIST-1 | TRAPPIST-1 e | TRAPPIST-1 |
It seems TRAPPIST-1e is the most habitable planet that can be found within this dataset. The other 63 entries given above are also solid candidates.
Every source used to create this material is referenced below, from guides on how to use technologies to calculations.
Undergraduate Research Center, How to Write an Abstract?
Datacamp, Unsupervised Machine Learning in Python
Probability Denstiy Function or Kernel Density Function?
Understanding Probability Mass Function
KNN Imputation in Machine Learning
Optimal K value in KNN Imputation
How to utilize the Elbow Method for Finding Optimal Cluster Size
A Complete Guide to Box Charts
Visualizing Clusters with Python’s Matplotlib
Astronomical Magnitude Systems
Metallicity, Wikipedia
Habitable Zone, Nasa
Luminosity Calculation
Distance According to Luminosity
Principal Component Analysis
PCA Explained
Why PCA Is Used?
Types of Clusterings
Hierarchical Clustering
Rand Score
Scatter Plot
Disclaimer! This notebook was prepared by Hikmet Güner and Deniz Erkin Kasaplı as a term project for the BBM467 - Data Intensive Applications class. The notebook is available for educational purposes only. There is no guarantee on the correctness of the content provided as it is a student work.
If you think there is any copyright violation, please let us know.